R4DS 05 - Exploratory Data Analysis
The codes below are from the practice exercises in https://r4ds.had.co.nz/, and are taken with reference from: https://jrnold.github.io/r4ds-exercise-solutions/
Loading tidyverse package. The main packages used are dplyr and ggplot2.
EDA is carried out to generate questions about your data, to search for answers by visualizing, transforming and modelling your data, and to use the insights to further generate new questions.
Visualizing distributions: Categorical - Use Bar Chart
count_cut <- diamonds %>%
count(cut) %>%
arrange(desc(n))
count_cut
# A tibble: 5 x 2
cut n
<ord> <int>
1 Ideal 21551
2 Premium 13791
3 Very Good 12082
4 Good 4906
5 Fair 1610
diamonds %>%
ggplot(aes(x = cut)) +
geom_bar(fill = "coral") +
theme_classic()
Visualizing distributions: Numerical - Histogram
diamonds %>%
count(cut_width(carat, 0.5)) # cut_width() makes groups of width width.
# A tibble: 11 x 2
`cut_width(carat, 0.5)` n
<fct> <int>
1 [-0.25,0.25] 785
2 (0.25,0.75] 29498
3 (0.75,1.25] 15977
4 (1.25,1.75] 5313
5 (1.75,2.25] 2002
6 (2.25,2.75] 322
7 (2.75,3.25] 32
8 (3.25,3.75] 5
9 (3.75,4.25] 4
10 (4.25,4.75] 1
11 (4.75,5.25] 1
diamonds %>%
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.5, fill = "deepskyblue4", col = "black") +
labs(title = "Histogram of carat size distribution in diamonds dataset",
subtitle = "Most of the diamonds have a carat value between 0.25 and 0.75") +
scale_x_continuous(n.breaks = 20) +
theme_classic()
A histogram divides the x-axis into equally spaced bins and then uses the height of each bar to display the number of observations that fall in each bin.
To overlay multiple histograms:
diamonds %>%
ggplot(aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1) +
theme_classic()
Typical Values
diamonds %>%
filter(carat<3) %>%
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.01, col = "coral") +
scale_x_continuous(n.breaks = 20) +
labs(title = "Histogram of diamond sizes < 3 carats.",
subtitle = "Most diamonds are smaller than 1 carat") +
theme_classic()
Unusual Values
diamonds %>%
ggplot(aes(x = y)) +
geom_histogram(aes(x = y), binwidth = 0.5, fill = "deepskyblue3") +
labs(title = "Histogram for Diamond Carat Sizes",
subtitle = "Using coord_cartesian() to reveal outlier values that might otherwise not be obvious.") +
scale_x_continuous(n.breaks = 20) +
coord_cartesian(ylim = c(0, 20)) +
theme_classic()
Explore the distribution of each of the x, y, z variables in diamonds dataset.
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22…
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very…
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, …
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1…
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, …
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 33…
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87…
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78…
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49…
diamonds %>%
select(x,y,z) %>%
summary() # summary statistics
x y z
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.700 Median : 5.710 Median : 3.530
Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :10.740 Max. :58.900 Max. :31.800
# geom_histogram
hist_x <- diamonds %>%
select(x) %>%
ggplot(aes(x = x)) +
geom_histogram(binwidth = 0.01, fill = "deepskyblue4") +
labs(title = "Histogram for x",
subtitle = "x has a multimodal distribution and is of the range of 0 - 10.74") +
scale_x_continuous(n.breaks = 30) +
theme_classic()
hist_y <- diamonds %>%
select(y) %>%
ggplot(aes(x = y)) +
geom_histogram(binwidth = 0.01, fill = "deepskyblue4") +
labs(title = "Histogram for y",
subtitle = "y has a multimodal distribution and is of the range of 0 - 58.90. There seem to be outliers in this variable.") +
scale_x_continuous(n.breaks = 30) +
theme_classic()
(hist_z <- diamonds %>%
select(z) %>%
ggplot(aes(x = z)) +
geom_histogram(binwidth = 0.01, fill = "deepskyblue4") +
labs(title = "Histogram for z",
subtitle = "z has a multimodal distribution and is of the range of 0 - 31.8 There seem to be outliers in this variable.") +
scale_x_continuous(n.breaks = 30) +
theme_classic())
gridExtra::grid.arrange(hist_x, hist_y, hist_z, nrow = 3)
x is likely to be length, y is likely to be breadth and z is likely to be depth.
diamonds %>%
ggplot(aes(x, y)) +
geom_point() +
labs(title = "Scatterplot of x vs y") +
theme_classic()
diamonds %>%
ggplot(aes(x, z)) +
geom_point() +
labs(title = "Scatterplot of x vs z") +
theme_classic()
The plots above make it easier to detect outliers in relation to other variables.
Explore the distribution of price
diamonds %>%
select(price) %>%
summary()
price
Min. : 326
1st Qu.: 950
Median : 2401
Mean : 3933
3rd Qu.: 5324
Max. :18823
diamonds %>%
select(price) %>%
ggplot(aes(x = price)) +
geom_histogram(binwidth = 5)
diamonds %>%
select(price) %>%
ggplot(aes(x = price)) +
geom_histogram(binwidth = 5) +
labs(title = "Histogram for price distribution of diamonds",
subtitle = "There is a strange gap between $1000 and $2000") +
coord_cartesian(xlim = c(0, 3000)) +
theme_classic()
How many diamonds are 0.99 carat? How many are 1 carat?
diamonds %>%
filter(carat == 0.99) %>%
count() # 23
# A tibble: 1 x 1
n
<int>
1 23
diamonds %>%
filter(carat == 1) %>%
count()
# A tibble: 1 x 1
n
<int>
1 1558
# This could be due to the price differences.
Missing Values
It is recommended to replace outlier values with missing values using mutate() and ifelse().
diamonds_b <- diamonds %>%
mutate(y_edited = ifelse(y<3 | y >20, NA, y))
diamonds_b %>%
ggplot(aes(x, y_edited)) +
geom_point(na.rm = T)
Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is the visualize the relationship between two or more variables.
diamonds %>%
ggplot(aes(x = fct_rev(cut), y = price)) +
geom_boxplot() +
labs(title = "How price varies for different cuts",
subtitle = "Ideal cuts cost lower than fair cuts",
x = "Cut",
y = "Price") +
theme_classic() +
coord_flip()
Improve the visualization for departure times of cancelled vs non-cancelled flights
library(nycflights13)
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, …
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2,…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, …
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EW…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FL…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 20…
flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min/60
) %>%
ggplot(aes(x = cancelled, y = sched_dep_time)) +
geom_boxplot() +
labs(title = "Departure times of cancelled vs non-cancelled flights") +
theme_classic()
Which variable in the diamonds dataset is the most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
glimpse(diamonds)
Rows: 53,940
Columns: 10
$ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22…
$ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very…
$ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, …
$ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1…
$ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, …
$ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 33…
$ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87…
$ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78…
$ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49…
# To investigate the relationship between price and carat
diamonds %>%
select(carat, price) %>%
ggplot(aes(carat, price)) +
labs(title = "Relationship between price and carat of diamonds",
subtitle = "The larger the diamond, the higher the price of diamonds") +
geom_point() +
theme_classic()
# To investigate the relationship between cut and price
diamonds %>%
select(cut, price) %>%
ggplot(aes(x = cut, y = price)) +
geom_boxplot(notch = T) +
labs(title = "Relationship between cut and price of diamonds",
subtitle = "There is a weak negative relationship between cut and price.") +
theme_classic()
# Relationship between color and price
diamonds %>%
select(color, price) %>%
ggplot(aes(x = fct_rev(color), y = price)) +
geom_boxplot(notch = T) +
labs(title = "Relationship between color and price of diamonds",
subtitle = "There is a weak negative relationship between cut and price.",
caption = "The scale of color goes from D (best) to J (worst)",
x = "Color",
y = "Price") +
theme_classic()
# Relationship between price and clarity
diamonds %>%
select(clarity, price) %>%
ggplot(aes(x = clarity, y = price)) +
geom_boxplot(notch = T) +
labs(title = "Relationship between clarity and price of diamonds",
subtitle = "There is a weak negative relationship between clarity and price.",
caption = "The scale of clarity goes from I1 (worst) to IF (best)",
x = "Clarity",
y = "Price") +
theme_classic()
Carat is the best predictor of diamond prices.
#Comparing carat and cut
diamonds %>%
select(carat, cut) %>%
ggplot(aes(carat, cut)) +
geom_boxplot() +
labs(title = "Relationship between cut and carat size",
subtitle = "Bigger diamonds may not be of the best cut.") +
theme_classic()
One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlier values”. One approach to remedy this problem is the letter value plot.
The letter value plots give more insights on how the data is distributed beyond the quantiles. Larger boxes mean that a larger proportion of the data falls within the boxed area.
diamonds %>%
ggplot(aes(price)) +
geom_histogram(col = "black", fill = "deepskyblue4") +
labs(title = "Histogram showing price distribution of different cuts of diamonds",
subtitle = "Histograms are good for comparing the overall shape of distributions across categories") +
facet_wrap( ~cut, ncol = 1, scales = "free_y") +
theme_classic()
diamonds %>%
ggplot(aes(cut, price)) +
geom_violin(fill = "orange") +
labs(title = "Violin plot showing distribution of prices for different cuts of diamonds",
subtitle = "Violin plot shows the full distribution of data, and are useful if the distribution is multimodal.") +
theme_classic() +
coord_flip()
To visualize the covariation between categorical variables, dplyr can be used to compute the count, and then visualize using geom_tile()
diamonds %>%
count(color, cut)
# A tibble: 35 x 3
color cut n
<ord> <ord> <int>
1 D Fair 163
2 D Good 662
3 D Very Good 1513
4 D Premium 1603
5 D Ideal 2834
6 E Fair 224
7 E Good 933
8 E Very Good 2400
9 E Premium 2337
10 E Ideal 3903
# … with 25 more rows
# visualize
diamonds %>%
count(color, cut) %>%
ggplot(aes(x = color, y = cut)) + # use the variable with longer labels on y axis
geom_tile(aes(fill = n)) +
scale_fill_gradient(low = "white", high = "deepskyblue4") # manual
To show the distribution of cut within color, the proportion of each cut within a color can be calculated.
diamonds %>%
count(color, cut) %>%
group_by(cut) %>%
mutate(proportion = n/sum(n)) %>%
ggplot(aes(color, cut)) +
geom_tile(aes(fill = proportion)) +
scale_fill_gradient(low = "white", high = "deepskyblue4") # manual
A scatter plot may be used to visualize the co-variation between two variables.
If the dataset is too big, one can bin one continuous variable such that it acts like a categorical variable, and then use boxplot or frequency polygon to visualize.
https://jrnold.github.io/r4ds-exercise-solutions/
For attribution, please cite this work as
lruolin (2021, May 1). pRactice corner: Exploratory Data Analysis. Retrieved from https://lruolin.github.io/myBlog/posts/20210501_Tidyverse Chap 5 - ggplot2/
BibTeX citation
@misc{lruolin2021exploratory, author = {lruolin, }, title = {pRactice corner: Exploratory Data Analysis}, url = {https://lruolin.github.io/myBlog/posts/20210501_Tidyverse Chap 5 - ggplot2/}, year = {2021} }